Skip to content

Optimize the use of xml_parent() vs. xml_find_all("parent::*")#3011

Open
MichaelChirico wants to merge 3 commits intomainfrom
xml-parent
Open

Optimize the use of xml_parent() vs. xml_find_all("parent::*")#3011
MichaelChirico wants to merge 3 commits intomainfrom
xml-parent

Conversation

@MichaelChirico
Copy link
Collaborator

Closes #2497. Written with Gemini. The results were pretty strongly against using xml_parent() for "typical" expressions {lintr} will encounter, so I also went ahead and refactored existing usages into XPath equivalents.

I've folded a summary of the dialog and a few full benchmarking scripts in Details below.

Details
Digest of Gemini dialog
  We evaluated replacing pure XPath parent jumps (/parent::*) with the xml2::xml_parent() C-pointer helper across various tree sizes (up to 10,000
  expressions).

  1. The Win: Full-Tree Initial Scans
   * Target: R/source_utils.R (Function call cache initialization).
   * Change: xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL")).
   * Finding: ~42% speedup (275ms vs. 159ms).
   * Reasoning: For massive full-tree scans of highly frequent nodes, xml_parent() avoids the overhead of the XPath engine's axis evaluation for
     every single match. This is the only pattern where the optimization was consistently beneficial.

  2. The Regressions: Nodesets and Hybrid Filtering
   * Target: Linters like duplicate_argument_linter and missing_package_linter.
   * Change: Jumping to a parent and then filtering in R (e.g., via vapply) or with a secondary self::expr[...] XPath.
   * Finding: ~87% to 130% slower.
   * Reasoning: We encountered the "R-to-C Wall." Transitioning from R to the C engine multiple times—plus the overhead of R-level loops—is significantly more expensive than a single, complex C-level XPath evaluation. Pure XPath handles parent jumps and predicates in a single pass.

  3. The Regressions: Jumps on Existing Nodesets
   * Target: Linters where we already have a calculated nodeset (e.g., implicit_integer_linter).
   * Change: xml_parent(nodes) vs. xml_find_all(nodes, "parent::*").
   * Finding: ~45% slower.
   * Reasoning: Once a nodeset is already in R memory, the XPath engine is highly optimized for relative axis jumps. The overhead of calling the R helper function xml_parent() actually introduces a regression compared to a relative XPath string.
For R/source_utils.R
library(lintr)
library(xml2)
library(microbenchmark)

# Generate a file with 10,000 function calls to emphasize full-tree scan costs
generate_large_file <- function(n_calls = 10000) {
  lines <- vapply(1:n_calls, function(i) sprintf("f%d(a = %d, b = %d)", i, i, i), character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(10000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content

bench_orig <- function() {
  res <- xml_find_all(xml, "//SYMBOL_FUNCTION_CALL/parent::expr")
  length(res)
}

bench_optimized <- function() {
  res <- xml_parent(xml_find_all(xml, "//SYMBOL_FUNCTION_CALL"))
  length(res)
}

cat("Benchmark (1) R/source_utils.R (Full-tree Cache Initialization):\n")
print(microbenchmark(orig = bench_orig(), optimized = bench_optimized(), times = 50))
unlink(temp)
For R/duplicate_argument_linter.R
library(lintr)
library(xml2)
library(glue)
library(microbenchmark)

# Setup
except <- c("mutate", "transmute")
ns_calls <- lintr:::xp_text_in_table(except)
xpath_orig <- glue("//EQ_SUB[not(preceding-sibling::expr/SYMBOL_FUNCTION_CALL[{ ns_calls }])]/parent::expr[count(EQ_SUB) > 1]")
xpath_hybrid_base <- glue("//EQ_SUB[not(preceding-sibling::expr/SYMBOL_FUNCTION_CALL[{ ns_calls }])]")

generate_large_file <- function(n_calls = 2000) {
  lines <- vapply(1:n_calls, function(i) {
    if (i %% 10 == 0) sprintf("list(a%d = 1, a%d = 2)", i, i)
    else sprintf("list(a%d = 1, b%d = 2)", i, i)
  }, character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(2000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content

bench_orig <- function() {
  call_expr <- xml_find_all(xml, xpath_orig)
  length(call_expr)
}

bench_hybrid <- function() {
  # Jump to parents then filter in R
  call_expr <- xml_parent(xml_find_all(xml, xpath_hybrid_base))
  call_expr <- call_expr[vapply(call_expr, function(x) length(xml_find_all(x, "EQ_SUB")) > 1L, logical(1L))]
  length(call_expr)
}

cat("\nBenchmark (2) R/duplicate_argument_linter.R (Hybrid vs. Pure XPath):\n")
print(microbenchmark(orig = bench_orig(), hybrid = bench_hybrid(), times = 50))
unlink(temp)
For R/implicit_integer_linter.R
library(lintr)
library(xml2)
library(microbenchmark)

generate_large_file <- function(n = 5000) {
  lines <- vapply(1:n, function(i) sprintf("x <- -%d", i), character(1L))
  sprintf("{\n%s\n}", paste(lines, collapse = "\n"))
}

text <- generate_large_file(5000)
temp <- tempfile()
writeLines(text, temp)
source_exprs <- get_source_expressions(temp)
xml <- source_exprs$full_xml_parsed_content %||% source_exprs$expressions[[1L]]$xml_parsed_content
nums <- xml_find_all(xml, "//NUM_CONST")

bench_orig <- function() {
  # Pure XPath relative jump
  is_negative <- !is.na(xml_find_first(nums, "parent::expr/preceding-sibling::OP-MINUS"))
}

bench_optimized <- function() {
  # xml_parent helper jump
  is_negative <- !is.na(xml_find_first(xml_parent(nums), "preceding-sibling::OP-MINUS"))
}

cat("\nBenchmark (3) R/implicit_integer_linter.R (Nodeset Jump: helper vs. XPath):\n")
print(microbenchmark(orig = bench_orig(), optimized = bench_optimized(), times = 100))
unlink(temp)

@MichaelChirico MichaelChirico changed the title Use xml_parent() in build_xml_find_function_calls() Optimize the use of xml_parent() vs. xml_find_all("parent::*") Jan 26, 2026
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.23%. Comparing base (e6b2834) to head (d98dc21).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3011   +/-   ##
=======================================
  Coverage   99.23%   99.23%           
=======================================
  Files         128      128           
  Lines        7288     7290    +2     
=======================================
+ Hits         7232     7234    +2     
  Misses         56       56           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prefer parent::* to parent::expr where possible

1 participant